Intorduction

Forest cover types and land cover plays an key role in environmental assessment. Accurate information of natural resources is important to many different entities like private, local government, federal agencies, conservation agencies. Normally, land cover data are generated by remote sensing data. However, those data set can be hard and costly to process. Therefore, we can try to use cartographic data to predict forest cover types. There are various supervised classification algorithms we can utilize in this dataset, included K-Nearest Neighbors, Support Vector Machine, Tree based methods, Neural Network. In this project, we will try to use as many methods as we can and compare the results. The original effect to classify the forest cover type on this dataset was able to achieve 70.52% classification accuracy using artificial neural network.

Problem Description

The main problem we are trying to solve is how we can predict forest cover type based on cartographic information. Which model performs the best on this classification task? Can we apply the same model to different regions? The results can be applied in some other analysis like fire hazard prevention, nature asset management, climate change, etc.

Another problem I am trying to address here is the trade-off of using remote sensing data. In many cases, remote sensing data is useful and beneficial. It can cover a large number of areas and places humans cannot reach in person. It also has the temporal element, allowing us to see the dynamic change of the environment. However, it also comes with the disadvantage. The files are too really large, so it requires good processing power. Also, remote sensing can be interfered by other phenomena like the weather. The main goal here is to see whether we can predict tree type just use cartographic information. If I have time, I will also try to combine remote sensing data with basic cartographic information. Would the prediction perform better with data from different origins and dimensions?

Objective

This project includes a supervised classification task. We will randomly devide our dataset in to training and tesing. Cross validation will be performed on the training dataset to get the best hyperparameter for each model. The main objective of this project is to find a method with high test accuracy on classifying forest cover types. The second objective is to achieve a high score on Kaggle's competition. The third objective is to use remote sensing to increase overall accuracy of the best model.

Data Description

Data Source

This dataset was retrieved from UCI Machine Learning Repository. It was originally from Jock A. Blackard in USFS and Dr. Denis J. Dean in UT Dallas.The actual forest cover type for a given observation (30 x 30 meter cell) was determined from US Forest Service (USFS) Region 2 Resource Information System (RIS) data. Independent variables were derived from data originally obtained from US Geological Survey (USGS) and USFS data. The are seven forest cover type classes: lodgepole pine, spruce/fir , ponderosa pine (Pinus ponderosa), Douglas-fir, aspen, cottonwood/willow, and krummholz.

Description and Basic statistics of the Data

Here is a description of all columns in this dataset.

Elevation - Elevation in meters \ Aspect - Aspect in degrees azimuth \ Slope - Slope in degrees \ Horizontal_Distance_To_Hydrology - Horz Dist to nearest surface water features \ Vertical_Distance_To_Hydrology - Vert Dist to nearest surface water features \ Horizontal_Distance_To_Roadways - Horz Dist to nearest roadway \ Hillshade_9am (0 to 255 index) - Hillshade index at 9am, summer solstice \ Hillshade_Noon (0 to 255 index) - Hillshade index at noon, summer solstice \ Hillshade_3pm (0 to 255 index) - Hillshade index at 3pm, summer solstice \ Horizontal_Distance_To_Fire_Points - Horz Dist to nearest wildfire ignition points \ Wilderness_Area (4 binary columns, 0 = absence or 1 = presence) - Wilderness area designation \ Soil_Type (40 binary columns, 0 = absence or 1 = presence) - Soil Type designation \ Cover_Type (7 types, integers 1 to 7) - Forest Cover Type designation \

The 7 different cover types are classifeid as: 1 - Spruce/Fir \ 2 - Lodgepole Pine \ 3 - Ponderosa Pine \ 4 - Cottonwood/Willow \ 5 - Aspen \ 6 - Douglas-fir \ 7 - Krummholz

Exploratory Data Analysis

In this section, I will perform some analysis and basic visualiztion of the tree cover type dataset.

There are 15120 records in the training set and test set is avilible on Kaggle. There are enough data for us to train and validate.

Here is a look at the first 5 rows of data.

Some names of attributes are way too long, therefore I renames some columns.

First, we get the basic information and statitics of each varibles.

From the pair plot below, we can see elevation can seperate forest cover type the best. Especially when it combine with Aspect and Slope. There are few outliers in the graph, we will look into this to see if its measurement error.

There is no missing data in this dataset.

Unsupervised Learning Results

Principle Component Analysis\ Principle Component are not supposed to work with binary data. The results from PCA are not ideal in my opinion. There is no clear pattern of principle Components of different tree species.

PCA performed poorly on this dataset. Let's take a look at the first three priciple Components and plot a 3d graph of them.

FAMD

PCA did not quite do the job. Multiple Factor analysis of mixed data is an alternative to PCA. Multiple Factor analysis of mixed data (FAMD) clearly has a better results than PCA here, It gives a better pattern to dissect different forest covers. I created a similar 3-D plot from the FAMD results, and it looks good. However, the explained variablity is still low, and different classes are overlapping.

Classification Results

Our first step would be split training/validation and test dataset.

Our task is a multi-class classification task. I included Logistic regression, Linear Discriminant Analysis , K-nearest neighbor, Decision tree, Gaussian Naive Bayes, and support vector machine.

K-nearest neighbor has the best performance in the classification algorithm comparision, our next step is try to improve it's performance by tunning the hyperparameters. Also I will add Xgboost and Multilayer Artificial Neural Network algorithms.

KNN

K neatest neighbor is a simple classification technique. There are few hyperparameters need to be tuned including numbers of neighbors, leaf size, Power parameter, and weigts.

The overall accuracy is good at 0.85, The accuracy on lodgepole pine and spruce/fir can be improved. They have a low recall score as a lot of them are classified as each other and aspen.

I also ploted a multiclass ROC curve from the yellowbrick package, the results are optimal for all kinds of forest covers.

XGBoost

XGBoost is a tree based method that will perform good on many high dimension classification tasks.

This is a optimization history plot of all trials as well as the best score at each point.

The accuracy of the model is 87%

Xgboost is definitely better than KNN at classifying this dataset, we see less classification error on Lodgepole pine and Spruce/Fir here. They are no more likely to be classified as Aspen.

Neural Network

Artifical Neural Network usually underperform on structured data, but we will give it a try here.

Here we are using keras tuner to optimize the number of layers, neuron in each layer, learning rate.

It is able to achieve 77% accuracy. The performance is not quite good as XGBoost, but it is on par with K-nearest neighbor. However, since K-nearest neighbor is a much easier method computationaly. K-neareset neighbor is definitely a better choice than ANN.

Conclusion

Overall, I was able to achieve good prediction accuracy on test dataset from K-nearest neighbor algorithm and XGBoost. XGBoost especially increase the precision and recall on classification of spruce/fir and Lodgepole pine. XGBoost was able to get a accuracy of 87.8%. Neural Network, on the other hand, was able to achieve an accuracy of . The major problem of classifying Lodgepole pine is that this species has widest range of environmental tolerance of any conifer in North America. It can adapt different climate and become minor in warm ,moist places and dominant in cold, dry places.

To further improve the accuracy score, I think the introduction of weather data might help with the misclassification situation here.